github header

Loan Approval Modeling and Predictive Analytics

1 Introduction

In today’s rapidly evolving financial sector, the efficient management of loan repayment is crucial for both lending institutions and borrowers. This project represents a strategic initiative to apply the latest in data science and machine learning to enhance the loan approval process. Our data-driven approach utilizes an extensive dataset, encompassing the profiles of over 300,000 loan applicants across 122 variables, to predict loan approval outcomes with a focus on repayment history. The aim is to create a predictive model that not only interprets applicant data but also offers actionable insights for lenders.

The primary objective of our research is to thoroughly analyze applicant characteristics and financial indicators within our dataset. This involves identifying patterns in the repayment history of borrowers, especially those in default, to develop predictive models. These models are designed to be more than statistical tools; they serve as strategic assets that can potentially streamline the loan approval process, enhancing efficiency and accuracy.

Our ambition goes beyond traditional data analysis. We are seeking to transform the way lending institutions make decisions, providing a deeper and more comprehensive risk assessment tool. The goal is to facilitate a shift in loan approval processes, reducing associated risks and improving the overall lending experience for all parties involved. This project is not just about handling data; it’s about innovating in the realm of financial decision-making.

The dataset chosen for this project includes critical variables that shed light on various aspects of an applicant’s profile, such as contract type, gender, car ownership, real estate ownership, and more. Analyzing these variables is crucial for our predictive modeling, offering insights into the diverse backgrounds and circumstances of loan applicants. This analysis is the cornerstone of our approach, aiming to enhance the accuracy and relevance of our loan approval predictions.

Reading the Dataset

train <- fread("application_train.csv", na.strings=c("NA","NaN","?", "","XNA"))
test <- fread("application_test.csv", na.strings=c("NA","NaN","?", "","XNA"),showProgress = FALSE)
dt1 <- fread('application_train.csv', showProgress = FALSE)
app <- read_csv("application_train.csv") %>% mutate(TARGET=as.factor(TARGET))
glimpse(train)
## Rows: 307,511
## Columns: 122
## $ SK_ID_CURR                   <int> 100002, 100003, 100004, 100006, 100007, 1…
## $ TARGET                       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ NAME_CONTRACT_TYPE           <chr> "Cash loans", "Cash loans", "Revolving lo…
## $ CODE_GENDER                  <chr> "M", "F", "M", "F", "M", "M", "F", "M", "…
## $ FLAG_OWN_CAR                 <chr> "N", "N", "Y", "N", "N", "N", "Y", "Y", "…
## $ FLAG_OWN_REALTY              <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ CNT_CHILDREN                 <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
## $ AMT_INCOME_TOTAL             <dbl> 202500.00, 270000.00, 67500.00, 135000.00…
## $ AMT_CREDIT                   <dbl> 406597.5, 1293502.5, 135000.0, 312682.5, …
## $ AMT_ANNUITY                  <dbl> 24700.5, 35698.5, 6750.0, 29686.5, 21865.…
## $ AMT_GOODS_PRICE              <dbl> 351000, 1129500, 135000, 297000, 513000, …
## $ NAME_TYPE_SUITE              <chr> "Unaccompanied", "Family", "Unaccompanied…
## $ NAME_INCOME_TYPE             <chr> "Working", "State servant", "Working", "W…
## $ NAME_EDUCATION_TYPE          <chr> "Secondary / secondary special", "Higher …
## $ NAME_FAMILY_STATUS           <chr> "Single / not married", "Married", "Singl…
## $ NAME_HOUSING_TYPE            <chr> "House / apartment", "House / apartment",…
## $ REGION_POPULATION_RELATIVE   <dbl> 0.018801, 0.003541, 0.010032, 0.008019, 0…
## $ DAYS_BIRTH                   <int> -9461, -16765, -19046, -19005, -19932, -1…
## $ DAYS_EMPLOYED                <int> -637, -1188, -225, -3039, -3038, -1588, -…
## $ DAYS_REGISTRATION            <dbl> -3648, -1186, -4260, -9833, -4311, -4970,…
## $ DAYS_ID_PUBLISH              <int> -2120, -291, -2531, -2437, -3458, -477, -…
## $ OWN_CAR_AGE                  <dbl> NA, NA, 26, NA, NA, NA, 17, 8, NA, NA, NA…
## $ FLAG_MOBIL                   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_EMP_PHONE               <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,…
## $ FLAG_WORK_PHONE              <int> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,…
## $ FLAG_CONT_MOBILE             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_PHONE                   <int> 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,…
## $ FLAG_EMAIL                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ OCCUPATION_TYPE              <chr> "Laborers", "Core staff", "Laborers", "La…
## $ CNT_FAM_MEMBERS              <dbl> 1, 2, 1, 2, 1, 2, 3, 2, 2, 1, 3, 2, 2, 3,…
## $ REGION_RATING_CLIENT         <int> 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,…
## $ REGION_RATING_CLIENT_W_CITY  <int> 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,…
## $ WEEKDAY_APPR_PROCESS_START   <chr> "WEDNESDAY", "MONDAY", "MONDAY", "WEDNESD…
## $ HOUR_APPR_PROCESS_START      <int> 10, 11, 9, 17, 11, 16, 16, 16, 14, 8, 15,…
## $ REG_REGION_NOT_LIVE_REGION   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ REG_REGION_NOT_WORK_REGION   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ LIVE_REGION_NOT_WORK_REGION  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ REG_CITY_NOT_LIVE_CITY       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ REG_CITY_NOT_WORK_CITY       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ LIVE_CITY_NOT_WORK_CITY      <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ ORGANIZATION_TYPE            <chr> "Business Entity Type 3", "School", "Gove…
## $ EXT_SOURCE_1                 <dbl> 0.08303697, 0.31126731, NA, NA, NA, NA, 0…
## $ EXT_SOURCE_2                 <dbl> 0.2629486, 0.6222458, 0.5559121, 0.650441…
## $ EXT_SOURCE_3                 <dbl> 0.13937578, NA, 0.72956669, NA, NA, 0.621…
## $ APARTMENTS_AVG               <dbl> 0.0247, 0.0959, NA, NA, NA, NA, NA, NA, N…
## $ BASEMENTAREA_AVG             <dbl> 0.0369, 0.0529, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BEGINEXPLUATATION_AVG  <dbl> 0.9722, 0.9851, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BUILD_AVG              <dbl> 0.6192, 0.7960, NA, NA, NA, NA, NA, NA, N…
## $ COMMONAREA_AVG               <dbl> 0.0143, 0.0605, NA, NA, NA, NA, NA, NA, N…
## $ ELEVATORS_AVG                <dbl> 0.00, 0.08, NA, NA, NA, NA, NA, NA, NA, N…
## $ ENTRANCES_AVG                <dbl> 0.0690, 0.0345, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMAX_AVG                <dbl> 0.0833, 0.2917, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMIN_AVG                <dbl> 0.1250, 0.3333, NA, NA, NA, NA, NA, NA, N…
## $ LANDAREA_AVG                 <dbl> 0.0369, 0.0130, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAPARTMENTS_AVG         <dbl> 0.0202, 0.0773, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAREA_AVG               <dbl> 0.0190, 0.0549, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAPARTMENTS_AVG      <dbl> 0.0000, 0.0039, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAREA_AVG            <dbl> 0.0000, 0.0098, NA, NA, NA, NA, NA, NA, N…
## $ APARTMENTS_MODE              <dbl> 0.0252, 0.0924, NA, NA, NA, NA, NA, NA, N…
## $ BASEMENTAREA_MODE            <dbl> 0.0383, 0.0538, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BEGINEXPLUATATION_MODE <dbl> 0.9722, 0.9851, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BUILD_MODE             <dbl> 0.6341, 0.8040, NA, NA, NA, NA, NA, NA, N…
## $ COMMONAREA_MODE              <dbl> 0.0144, 0.0497, NA, NA, NA, NA, NA, NA, N…
## $ ELEVATORS_MODE               <dbl> 0.0000, 0.0806, NA, NA, NA, NA, NA, NA, N…
## $ ENTRANCES_MODE               <dbl> 0.0690, 0.0345, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMAX_MODE               <dbl> 0.0833, 0.2917, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMIN_MODE               <dbl> 0.1250, 0.3333, NA, NA, NA, NA, NA, NA, N…
## $ LANDAREA_MODE                <dbl> 0.0377, 0.0128, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAPARTMENTS_MODE        <dbl> 0.0220, 0.0790, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAREA_MODE              <dbl> 0.0198, 0.0554, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAPARTMENTS_MODE     <dbl> 0.0000, 0.0000, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAREA_MODE           <dbl> 0.0000, 0.0000, NA, NA, NA, NA, NA, NA, N…
## $ APARTMENTS_MEDI              <dbl> 0.0250, 0.0968, NA, NA, NA, NA, NA, NA, N…
## $ BASEMENTAREA_MEDI            <dbl> 0.0369, 0.0529, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BEGINEXPLUATATION_MEDI <dbl> 0.9722, 0.9851, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BUILD_MEDI             <dbl> 0.6243, 0.7987, NA, NA, NA, NA, NA, NA, N…
## $ COMMONAREA_MEDI              <dbl> 0.0144, 0.0608, NA, NA, NA, NA, NA, NA, N…
## $ ELEVATORS_MEDI               <dbl> 0.00, 0.08, NA, NA, NA, NA, NA, NA, NA, N…
## $ ENTRANCES_MEDI               <dbl> 0.0690, 0.0345, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMAX_MEDI               <dbl> 0.0833, 0.2917, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMIN_MEDI               <dbl> 0.1250, 0.3333, NA, NA, NA, NA, NA, NA, N…
## $ LANDAREA_MEDI                <dbl> 0.0375, 0.0132, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAPARTMENTS_MEDI        <dbl> 0.0205, 0.0787, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAREA_MEDI              <dbl> 0.0193, 0.0558, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAPARTMENTS_MEDI     <dbl> 0.0000, 0.0039, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAREA_MEDI           <dbl> 0.0000, 0.0100, NA, NA, NA, NA, NA, NA, N…
## $ FONDKAPREMONT_MODE           <chr> "reg oper account", "reg oper account", N…
## $ HOUSETYPE_MODE               <chr> "block of flats", "block of flats", NA, N…
## $ TOTALAREA_MODE               <dbl> 0.0149, 0.0714, NA, NA, NA, NA, NA, NA, N…
## $ WALLSMATERIAL_MODE           <chr> "Stone, brick", "Block", NA, NA, NA, NA, …
## $ EMERGENCYSTATE_MODE          <chr> "No", "No", NA, NA, NA, NA, NA, NA, NA, N…
## $ OBS_30_CNT_SOCIAL_CIRCLE     <dbl> 2, 1, 0, 2, 0, 0, 1, 2, 1, 2, 0, 0, 0, 0,…
## $ DEF_30_CNT_SOCIAL_CIRCLE     <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ OBS_60_CNT_SOCIAL_CIRCLE     <dbl> 2, 1, 0, 2, 0, 0, 1, 2, 1, 2, 0, 0, 0, 0,…
## $ DEF_60_CNT_SOCIAL_CIRCLE     <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ DAYS_LAST_PHONE_CHANGE       <dbl> -1134, -828, -815, -617, -1106, -2536, -1…
## $ FLAG_DOCUMENT_2              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_3              <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…
## $ FLAG_DOCUMENT_4              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_5              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_6              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ FLAG_DOCUMENT_7              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_8              <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_9              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_10             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_11             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_12             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_13             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_14             <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_15             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_16             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_17             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_18             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_19             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_20             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_21             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AMT_REQ_CREDIT_BUREAU_HOUR   <dbl> 0, 0, 0, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, …
## $ AMT_REQ_CREDIT_BUREAU_DAY    <dbl> 0, 0, 0, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, …
## $ AMT_REQ_CREDIT_BUREAU_WEEK   <dbl> 0, 0, 0, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, …
## $ AMT_REQ_CREDIT_BUREAU_MON    <dbl> 0, 0, 0, NA, 0, 0, 1, 0, 0, NA, 1, 0, 1, …
## $ AMT_REQ_CREDIT_BUREAU_QRT    <dbl> 0, 0, 0, NA, 0, 1, 1, 0, 0, NA, 0, 0, 0, …
## $ AMT_REQ_CREDIT_BUREAU_YEAR   <dbl> 1, 0, 0, NA, 0, 1, 2, 0, 1, NA, 0, 2, 0, …
cat("application_train : (" , nrow(train) , "," , ncol(train) , ")\n")
## application_train : ( 307511 , 122 )
a=colnames(train)
b=colSums(is.na(train))  %>% as.data.table

missing_value_table=cbind(a,b)

colnames(missing_value_table)=c("variables","Missing_values")

missing_value_table = missing_value_table  %>% filter(Missing_values>0)  %>% 
                        mutate("% of Total Values" = round(100 * (Missing_values / nrow(train)),1))  %>% 
                        arrange(desc(Missing_values))
cat("Your selected dataframe has" , ncol(train) , " columns.\n")
## Your selected dataframe has 122  columns.
cat("There are" , nrow(missing_value_table) , "columns that have missing values.")
## There are 69 columns that have missing values.
plot_intro(train)

In our analysis of the application_train dataset, comprising 307,511 rows and 122 columns, we identified that 69 columns have missing values. Notably, discrete and continuous columns constitute 13.1% and 86.9%, respectively. Fortunately, there are no entirely missing columns, and 2.8% of rows are complete. However, 24.5% of observations have some missing values. These findings offer a holistic view of the dataset’s composition and highlight areas requiring attention, ensuring a more informed and strategic approach to our analysis and decision-making processes.

head(missing_value_table,15)
##                    variables Missing_values % of Total Values
##  1:           COMMONAREA_AVG         214865              69.9
##  2:          COMMONAREA_MODE         214865              69.9
##  3:          COMMONAREA_MEDI         214865              69.9
##  4:  NONLIVINGAPARTMENTS_AVG         213514              69.4
##  5: NONLIVINGAPARTMENTS_MODE         213514              69.4
##  6: NONLIVINGAPARTMENTS_MEDI         213514              69.4
##  7:       FONDKAPREMONT_MODE         210295              68.4
##  8:     LIVINGAPARTMENTS_AVG         210199              68.4
##  9:    LIVINGAPARTMENTS_MODE         210199              68.4
## 10:    LIVINGAPARTMENTS_MEDI         210199              68.4
## 11:            FLOORSMIN_AVG         208642              67.8
## 12:           FLOORSMIN_MODE         208642              67.8
## 13:           FLOORSMIN_MEDI         208642              67.8
## 14:          YEARS_BUILD_AVG         204488              66.5
## 15:         YEARS_BUILD_MODE         204488              66.5

Examining the Target Variable

ggplot(train,aes(TARGET))+
    geom_bar(fill = "blue",alpha=0.3)+
    geom_text(aes(label =scales::percent(..count../sum(..count..))),stat = 'count',vjust = -0.5)+
    scale_y_continuous(label = comma)

The bar plot of the TARGET variable in the indicates a highly imbalanced class problem, with 92% of loans falling into the category of timely repayments and only 8% representing defaults. This observation underscores the need to address the data imbalance issue for effective model training. To mitigate this, we assigned appropriate weights to the classes during the training phase, ensuring that the model is not skewed toward the majority class and can make more accurate predictions for both repayment outcomes.

train%>%
    select_if(is.character)->carData

colNames<-names(carData)
carData$TARGET<-train$TARGET
carData$SK_ID_CURR<-train$SK_ID_CURR

for (name in colNames){
  p<-distColumn(train,name,'TARGET','SK_ID_CURR') 
  plot(p)
}

The visual analysis of loan categories based on applicant characteristics reveals several key observations:

Cash Loans Dominance: The majority of loans taken were in the form of Cash Loans, and a significant portion of these loans were repaid on time.

Occupational Impact: Different occupational groups exhibit varying repayment behaviors. Working individuals, state servants, and commercial associates took a considerable number of loans, with a default rate of roughly 5% or less. In contrast, laborers secured the highest number of loans and displayed timely repayment.

Sector Disparities: Loans were less frequent among individuals in IT and HR sectors, suggesting a lower demand or eligibility within these professional domains.

The graphical representations provide a succinct overview, presenting both the percentage and count of loans taken, along with a breakdown of timely repayments and defaults. These insights can guide strategic decision-making for lending institutions, aiding in risk assessment and resource allocation.

train=train  %>% mutate_if(is.character, list(~factor(.)))
train=train  %>% mutate_if(is.integer, list(~as.numeric(.)))
train  %>% select_if(is.factor)  %>% summarise_all(n_distinct)  %>% t()
##                            [,1]
## NAME_CONTRACT_TYPE            2
## CODE_GENDER                   3
## FLAG_OWN_CAR                  2
## FLAG_OWN_REALTY               2
## NAME_TYPE_SUITE               8
## NAME_INCOME_TYPE              8
## NAME_EDUCATION_TYPE           5
## NAME_FAMILY_STATUS            6
## NAME_HOUSING_TYPE             6
## OCCUPATION_TYPE              19
## WEEKDAY_APPR_PROCESS_START    7
## ORGANIZATION_TYPE            58
## FONDKAPREMONT_MODE            5
## HOUSETYPE_MODE                4
## WALLSMATERIAL_MODE            8
## EMERGENCYSTATE_MODE           3
summary(train[,'DAYS_BIRTH'])
##    DAYS_BIRTH    
##  Min.   :-25229  
##  1st Qu.:-19682  
##  Median :-15750  
##  Mean   :-16037  
##  3rd Qu.:-12413  
##  Max.   : -7489
summary(train[,'DAYS_BIRTH'] /-365)
##    DAYS_BIRTH   
##  Min.   :20.52  
##  1st Qu.:34.01  
##  Median :43.15  
##  Mean   :43.94  
##  3rd Qu.:53.92  
##  Max.   :69.12
summary(train[,'DAYS_EMPLOYED'])
##  DAYS_EMPLOYED   
##  Min.   :-17912  
##  1st Qu.: -2760  
##  Median : -1213  
##  Mean   : 63815  
##  3rd Qu.:  -289  
##  Max.   :365243
ggplot(train,aes(DAYS_EMPLOYED))+
    geom_histogram(fill = "blue",alpha=0.3)+
    scale_x_continuous(label = comma)+
    scale_y_continuous(label = comma) 

anom= train  %>% filter(DAYS_EMPLOYED==365243)
non_anom= train %>% filter(DAYS_EMPLOYED!=365243)

sprintf('There are %d anomalous days of employment' , nrow(anom) )
## [1] "There are 55374 anomalous days of employment"

In our analysis, we discovered a group of anomalous values in the “DAYS_EMPLOYED” column where all instances had a value of 365,243. Surprisingly, loans associated with these anomalies exhibited a lower default rate (5.40%) compared to non-anomalous cases(8.66%). Given this intriguing finding, handling these anomalies becomes crucial. We opted for a cautious approach, transforming the anomalous values to NaN and introducing a new boolean column to signify the original anomaly status. This preprocessing step aligns the employment days distribution more closely with expectations. Our strategy involves filling NaN values,acknowledging the original anomaly status. The transformation aims to enhance the model’s understanding, paving the way for more accurate predictions.

ggplot(train,aes(DAYS_EMPLOYED))+
    geom_histogram(fill = "blue",alpha=0.3)+
    scale_x_continuous(label = comma)+
    scale_y_continuous(label = comma)+
    ggtitle("Days Employment Histogram")

Effect of Age on Repayment

train[,'DAYS_BIRTH'] = abs(train[,'DAYS_BIRTH'])

cor(train[,'DAYS_BIRTH'],train[,'TARGET'])
##                 TARGET
## DAYS_BIRTH -0.07823931

“DAYS_BIRTH” variable exhibits the most positive correlation with loan repayment. This variable represents the age of the client at the time of the loan, expressed in negative days. The positive correlation implies that, counterintuitively, as clients get older (i.e., the absolute value of “DAYS_BIRTH” increases), they are less likely to default on their loans. To clarify, we transformed the variable to its absolute value, revealing a negative correlation of approximately -0.0782. This negative correlation indicates that, in general, as clients age, there is a tendency for them to repay their loans on time more frequently. Visualizing this relationship through a histogram of client ages in years provides a clearer understanding of the age-related dynamics in loan repayment behavior.

ggplot(train, aes(DAYS_BIRTH /365))+
    geom_histogram(fill = "blue",alpha=0.3,bins = 25)+
    scale_x_continuous(label = comma)+
    scale_y_continuous(label = comma)+
    theme_light() + 
    labs(title = "Age of Client", x = "Age (years)")+
    theme(plot.title = element_text(hjust = .5))

On its own, analyzing the age distribution provides limited insights, mainly confirming the absence of outliers as all recorded ages appear reasonable. To gain a deeper understanding of how age influences the target variable, we employed a Kernel Density Estimation plot (KDE) that incorporates color differentiation based on the target values.

A Kernel Density Estimate plot serves as a valuable visualization tool, illustrating the distribution of a single variable akin to a smoothed histogram. This plot is generated by computing a kernel, at each data point and subsequently averaging these individual kernels to produce a unified, smooth curve.

In the forthcoming analysis, we will leverage the ggplot KDE plot to create an insightful graph that sheds light on the relationship between age and the target variable. The color-coded KDE plot will provide a nuanced depiction of how age dynamics correlate with the target, offering a more comprehensive view of the impact of age on loan repayment behavior.

ggplot(train, aes(DAYS_BIRTH /365,colour=as.factor(TARGET),group=as.factor(TARGET)))+
    geom_density()+
    scale_x_continuous(label = comma)+
    scale_y_continuous()+
    theme_light() + 
    labs(title = "Distribution of Ages", x = "Age (years)")+
    theme(plot.title = element_text(hjust = .5))

We generated a density plot illustrating the distribution of ages, color-coded by the target variable. The plot reveals that the curve corresponding to the target variable (TARGET == 1) skews towards the younger age range. Although the correlation coefficient is not notably high (-0.07), the variable remains valuable for machine learning models due to its discernible impact on the target.

In a complementary analysis, the age data is binned into 5-year intervals, allowing for a closer examination of the relationship between age and loan repayment. By calculating the average value of the target variable within each age bracket, the graph provides insights into the proportion of loans that were not repaid in different age categories. This approach offers a nuanced perspective on the influence of age on the likelihood of loan repayment failure.

age_data <- train %>%
  select(TARGET, DAYS_BIRTH) %>%
  mutate(YEARS_BIRTH = DAYS_BIRTH / 365)

# Bin the age data
age_data$YEARS_BINNED <- cut(age_data$YEARS_BIRTH, breaks = seq(20, 70, by = 5))
head(age_data, 10)
##     TARGET DAYS_BIRTH YEARS_BIRTH YEARS_BINNED
##  1:      1       9461    25.92055      (25,30]
##  2:      0      16765    45.93151      (45,50]
##  3:      0      19046    52.18082      (50,55]
##  4:      0      19005    52.06849      (50,55]
##  5:      0      19932    54.60822      (50,55]
##  6:      0      16941    46.41370      (45,50]
##  7:      0      13778    37.74795      (35,40]
##  8:      0      18850    51.64384      (50,55]
##  9:      0      20099    55.06575      (55,60]
## 10:      0      14469    39.64110      (35,40]
# Group by the bin and calculate averages
age_data  %>% group_by(YEARS_BINNED)  %>% summarise_all(mean)
## # A tibble: 10 × 4
##    YEARS_BINNED TARGET DAYS_BIRTH YEARS_BIRTH
##    <fct>         <dbl>      <dbl>       <dbl>
##  1 (20,25]      0.123       8533.        23.4
##  2 (25,30]      0.111      10155.        27.8
##  3 (30,35]      0.103      11855.        32.5
##  4 (35,40]      0.0894     13708.        37.6
##  5 (40,45]      0.0785     15498.        42.5
##  6 (45,50]      0.0742     17324.        47.5
##  7 (50,55]      0.0670     19196.        52.6
##  8 (55,60]      0.0553     20984.        57.5
##  9 (60,65]      0.0527     22781.        62.4
## 10 (65,70]      0.0373     24293.        66.6
age_data  %>% group_by(YEARS_BINNED)  %>% summarise_all(mean)  %>% 
    ggplot(aes(x=YEARS_BINNED, y=TARGET*100))+
        geom_col(fill = "blue",alpha=0.3)+
        labs(title = "Failure to Repay by Age Group", x = "Age Group (years)", y= 'Failure to Repay (%)')+
        theme(plot.title = element_text(hjust = .5))

The presented visualization and analysis of the grouped age data clearly highlight a discernible trend: younger applicants exhibit a higher likelihood of failing to repay loans. Specifically, the failure to repay rates surpass 10% for the three youngest age categories, contrasting with rates below 5% for the oldest age group.

This actionable insight carries direct implications for the bank’s decision-making processes. The information suggests that younger clients may benefit from additional support, guidance, or financial planning tips to enhance their repayment capabilities. Importantly, this recommendation is not advocating for discriminatory practices but rather underscores the prudence of implementing precautionary measures to assist younger clients in meeting their payment obligations more effectively. While age alone may not be a decisive factor, incorporating it into machine learning models can contribute to more nuanced credit risk assessments.

Exterior Sources Analysis

The analysis of exterior data sources reveals three variables, namely EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3, with the most notable negative correlations with the target variable. As per the documentation, these features represent normalized scores derived from external data sources, potentially constituting a cumulative credit rating based on diverse data inputs.

ext_data = train[,c('TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH')]
ext_data_corrs = cor(ext_data, use = "pairwise")
ext_data_corrs
##                   TARGET EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3  DAYS_BIRTH
## TARGET        1.00000000   -0.1553171  -0.16047167   -0.1789187 -0.07823931
## EXT_SOURCE_1 -0.15531713    1.0000000   0.21398238    0.1868455  0.60060997
## EXT_SOURCE_2 -0.16047167    0.2139824   1.00000000    0.1091673  0.09199587
## EXT_SOURCE_3 -0.17891870    0.1868455   0.10916729    1.0000000  0.20547760
## DAYS_BIRTH   -0.07823931    0.6006100   0.09199587    0.2054776  1.00000000
melted_cormat <- melt(ext_data_corrs,na.rm=TRUE)
head(melted_cormat)
##           Var1         Var2       value
## 1       TARGET       TARGET  1.00000000
## 2 EXT_SOURCE_1       TARGET -0.15531713
## 3 EXT_SOURCE_2       TARGET -0.16047167
## 4 EXT_SOURCE_3       TARGET -0.17891870
## 5   DAYS_BIRTH       TARGET -0.07823931
## 6       TARGET EXT_SOURCE_1 -0.15531713

The correlation matrix illustrates the relationships between these external sources and the target, as well as their intercorrelations. Notably, EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3 all exhibit negative correlations with the target, suggesting that higher scores from these external sources are associated with a lower likelihood of loan default. Moreover, examining the correlations between the external sources themselves and with other variables like DAYS_BIRTH provides a comprehensive view of their interconnectedness.

These findings imply that these external sources, capturing additional information beyond the immediate loan application, play a crucial role in predicting the likelihood of timely loan repayment. In a practical sense, these external data features can serve as valuable inputs for machine learning models, contributing to more accurate assessments of creditworthiness and aiding in risk mitigation for the lending institution.

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson\nCorrelation") +
  labs(title='Correlation Heatmap')+
  theme(plot.title = element_text(hjust = .5), axis.text.x = element_text(angle=90),
       axis.title.x=element_blank(),
       axis.title.y=element_blank())+
  coord_fixed()+
  geom_text(aes(Var2, Var1, label = round(value,2)), color = "black", size = 3)

The presented correlation heatmap visually encapsulates the relationships between different variables, particularly highlighting the Pearson correlation coefficients. The three EXT_SOURCE features exhibit negative correlations with the target, implying that higher EXT_SOURCE values are associated with a greater likelihood of loan repayment. Additionally, there’s a positive correlation between DAYS_BIRTH and EXT_SOURCE_1, suggesting that client age might influence this scoring mechanism.

To delve deeper into the impact of these features on loan repayment, the distribution of each variable is explored, color-coded by the target value. This approach enables a more nuanced understanding of how each variable influences the likelihood of loan default. Such insights are crucial for refining credit risk assessment models and tailoring strategies to enhance the overall repayment rates for clients.

p1=ggplot(train, aes(EXT_SOURCE_1,colour=as.factor(TARGET),group=as.factor(TARGET)))+
    geom_density()+
    scale_x_continuous(breaks=seq(0,1,by=0.2))+
    scale_y_continuous()+
    theme_light() + 
    labs(title = "Distribution of EXT_SOURCE_1 by Target Value", x = "EXT_SOURCE_1")+
    theme(plot.title = element_text(hjust = .5))

p2=ggplot(train, aes(EXT_SOURCE_2,colour=as.factor(TARGET),group=as.factor(TARGET)))+
    geom_density()+
    scale_x_continuous(breaks=seq(0,1,by=0.2))+
    scale_y_continuous()+
    theme_light() + 
    labs(title = "Distribution of EXT_SOURCE_2 by Target Value", x = "EXT_SOURCE_2")+
    theme(plot.title = element_text(hjust = .5))
p3=ggplot(train, aes(EXT_SOURCE_3,colour=as.factor(TARGET),group=as.factor(TARGET)))+
    geom_density()+
    scale_x_continuous(breaks=seq(0,1,by=0.2))+
    scale_y_continuous()+
    theme_light() + 
    labs(title = "Distribution of EXT_SOURCE_3 by Target Value", x = "EXT_SOURCE_3")+
    theme(plot.title = element_text(hjust = .5))

grid.arrange(p1,p2,p3,nrow=3)

The visualizations of EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3 distributions, stratified by the target value, reveal interesting insights. While all three features exhibit subtle differences between the target values, EXT_SOURCE_3 stands out as having the most pronounced distinction. This suggests that EXT_SOURCE_3 holds a discernible relationship with the likelihood of loan repayment. Despite the correlations being characterized as weak, these variables, including EXT_SOURCE_1 and EXT_SOURCE_2, remain valuable inputs for a machine learning model. Leveraging these features in predictive models can contribute to assessing and forecasting an applicant’s repayment behavior, enhancing the precision of credit risk evaluations.

2 Feature Engineering

In optimizing predictor variables for enhanced machine learning model performance, a systematic feature engineering process was employed. This involved centering and scaling to standardize predictor values, mitigating skewness by applying logarithmic, square root, or inverse transformations to numeric variables, and employing Box-Cox transformation to further enhance non-normal distributions. Lambda estimates were derived for each predictor, indicating the extent of transformation. Pre-processing steps incorporated Box-Cox transformations for specific columns, contributing to improved data normality. The impact of these transformations was visually assessed through histograms, providing a clear before-and-after view. These efforts aim to create a more robust set of predictor variables, potentially boosting the effectiveness of subsequent machine learning models.

#Skew Values
skewValues <- as.data.frame(apply(dt1_num, 2, function(x) skewness(x, na.rm = TRUE)))
colnames(skewValues)[1] <- "skew_values"
skewValues <- index_to_col(skewValues,'Column')
skewValues <- setDT(skewValues)[order (skew_values, decreasing = TRUE)]
top_15 <- head(skewValues, 15)
print(top_15)
##                        Column skew_values
##  1:          FLAG_DOCUMENT_12   392.11095
##  2:          AMT_INCOME_TOTAL   391.55583
##  3:          FLAG_DOCUMENT_10   209.58701
##  4:           FLAG_DOCUMENT_2   153.79032
##  5: AMT_REQ_CREDIT_BUREAU_QRT   134.36426
##  6:           FLAG_DOCUMENT_4   110.89328
##  7:           FLAG_DOCUMENT_7    72.17340
##  8:          FLAG_DOCUMENT_17    61.21354
##  9:          FLAG_DOCUMENT_21    54.61241
## 10:          FLAG_DOCUMENT_20    44.36446
## 11:          FLAG_DOCUMENT_19    40.95573
## 12:          FLAG_DOCUMENT_15    28.69905
## 13: AMT_REQ_CREDIT_BUREAU_DAY    27.04320
## 14:          FLAG_DOCUMENT_14    18.37235
## 15:          FLAG_DOCUMENT_13    16.75358
BoxCoxValues <- apply(dt1_num, 2, function(x) BoxCoxTrans(x, na.rm = TRUE))
x = list()

for (i in 1:ncol(dt1_num)){
     lambda <- BoxCoxValues[[i]][[1]]
     x[[i]] <- lambda
}

lambda = do.call(rbind, x)
lambda_df <- as.data.frame(cbind(colnames(dt1_num),lambda))
colnames(lambda_df)[1] <- "Column"
colnames(lambda_df)[2] <- "lambda"
knitr::kable(setDT(lambda_df)[!is.na(lambda)])
Column lambda
SK_ID_CURR 0.7
AMT_INCOME_TOTAL -0.0999999999999999
AMT_CREDIT 0.2
AMT_ANNUITY 0.2
AMT_GOODS_PRICE 0.2
REGION_POPULATION_RELATIVE 0.3
CNT_FAM_MEMBERS 0.2
REGION_RATING_CLIENT 1
REGION_RATING_CLIENT_W_CITY 1
EXT_SOURCE_1 0.9
EXT_SOURCE_2 1.5
EXT_SOURCE_3 1.2
preProcValues <- preProcess(dt1, method = "BoxCox")
preProcValues
## Created from 11351 samples and 28 variables
## 
## Pre-processing:
##   - Box-Cox transformation (12)
##   - ignored (16)
## 
## Lambda estimates for Box-Cox transformation:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1000  0.2000  0.5000  0.6083  1.0000  1.5000
dt1_tran <- predict(preProcValues, dt1)
col_trans <- lambda_df[!is.na(lambda)]$Column
i = 5
x <- list(
  title = as.character(col_trans[i])
)
p1 <- plot_ly(x = ~setDT(dt1)[,get(as.character(col_trans[i]))], type = "histogram", autobinx = FALSE) %>% layout(showlegend = FALSE) 
p2 <- plot_ly(x = ~setDT(dt1_tran)[,get(as.character(col_trans[i]))], type = "histogram", autobinx = FALSE) %>% layout(showlegend = FALSE)
subplot(p1,p2)

Before Transformation

doPlots(as.data.frame(dt1)[, (colnames(dt1) %in% as.character(col_trans))], plotHist, ii = 1:length(col_trans))

After Transformation

doPlots(as.data.frame(dt1_tran)[, (colnames(dt1_tran) %in% as.character(col_trans))], plotHist, ii = 1:length(col_trans))

Removing Predictors

nzv <- nearZeroVar(dt1,saveMetrics= TRUE)
nzv <- index_to_col(nzv,"Column")
nzv_tb <- setDT(nzv)[nzv == TRUE | zeroVar ==TRUE]
head(nzv_tb, 15)
##                          Column    freqRatio percentUnique zeroVar  nzv
##  1:               DAYS_EMPLOYED    354.96154  4.0889594193   FALSE TRUE
##  2:                  FLAG_MOBIL 307510.00000  0.0006503832   FALSE TRUE
##  3:            FLAG_CONT_MOBILE    534.73345  0.0006503832   FALSE TRUE
##  4:  REG_REGION_NOT_LIVE_REGION     65.03199  0.0006503832   FALSE TRUE
##  5: LIVE_REGION_NOT_WORK_REGION     23.59498  0.0006503832   FALSE TRUE
##  6:            BASEMENTAREA_AVG     58.74502  1.2292243204   FALSE TRUE
##  7:                LANDAREA_AVG     82.53968  1.1469508408   FALSE TRUE
##  8:           NONLIVINGAREA_AVG    107.57326  1.0698804270   FALSE TRUE
##  9:           BASEMENTAREA_MODE     61.70260  1.2490610092   FALSE TRUE
## 10:               LANDAREA_MODE     83.90865  1.1586577391   FALSE TRUE
## 11:          NONLIVINGAREA_MODE    140.72537  1.0819125170   FALSE TRUE
## 12:           BASEMENTAREA_MEDI     55.31734  1.2266227875   FALSE TRUE
## 13:               LANDAREA_MEDI     80.80711  1.1576821642   FALSE TRUE
## 14:          NONLIVINGAREA_MEDI    113.08720  1.0806117505   FALSE TRUE
## 15:             FLAG_DOCUMENT_2  23653.69231  0.0006503832   FALSE TRUE
df_corr = cor(dt1_num2, use = "pairwise.complete.obs")
hc = findCorrelation(df_corr, cutoff=0.80)
hc = sort(hc)
dt1_num3 = as.data.frame(dt1_num2)[,-c(hc)]
rm_col_hc <- setdiff(colnames(dt1_num2),colnames(dt1_num3))
rm_col_hc
##  [1] "AMT_ANNUITY"                  "AMT_GOODS_PRICE"             
##  [3] "FLAG_EMP_PHONE"               "REGION_RATING_CLIENT_W_CITY" 
##  [5] "REG_REGION_NOT_WORK_REGION"   "REG_CITY_NOT_WORK_CITY"      
##  [7] "APARTMENTS_AVG"               "BASEMENTAREA_AVG"            
##  [9] "YEARS_BEGINEXPLUATATION_AVG"  "YEARS_BUILD_AVG"             
## [11] "COMMONAREA_AVG"               "ELEVATORS_AVG"               
## [13] "ENTRANCES_AVG"                "FLOORSMAX_AVG"               
## [15] "FLOORSMIN_AVG"                "LANDAREA_AVG"                
## [17] "LIVINGAPARTMENTS_AVG"         "LIVINGAREA_AVG"              
## [19] "NONLIVINGAPARTMENTS_AVG"      "NONLIVINGAREA_AVG"           
## [21] "APARTMENTS_MODE"              "YEARS_BUILD_MODE"            
## [23] "LIVINGAREA_MODE"              "APARTMENTS_MEDI"             
## [25] "BASEMENTAREA_MEDI"            "YEARS_BEGINEXPLUATATION_MEDI"
## [27] "COMMONAREA_MEDI"              "ELEVATORS_MEDI"              
## [29] "ENTRANCES_MEDI"               "FLOORSMAX_MEDI"              
## [31] "FLOORSMIN_MEDI"               "LANDAREA_MEDI"               
## [33] "LIVINGAPARTMENTS_MEDI"        "LIVINGAREA_MEDI"             
## [35] "NONLIVINGAPARTMENTS_MEDI"     "NONLIVINGAREA_MEDI"          
## [37] "TOTALAREA_MODE"               "OBS_30_CNT_SOCIAL_CIRCLE"    
## [39] "DEF_30_CNT_SOCIAL_CIRCLE"

We removed all the columns identified as highly correlated and/or nzv. Added Predictors and changed the categorical to dummy variables.

dt1_preproc <- cbind(dt1_non_num_dum,dt1_num)

mv <- as.data.frame(apply(dt1_preproc, 2, function(col)sum(is.na(col))/length(col)))
colnames(mv)[1] <- "missing_values"
mv <- index_to_col(mv,'Column')
mv <- setDT(mv)[order (missing_values, decreasing = TRUE)]

ggplot (mv[1:40,], aes (reorder(Column, missing_values), missing_values)) + geom_bar (position = position_dodge(), stat = "identity") + coord_flip () + xlab('Columns') + ylab('Missing Value %')

dt1_preproc <- na.aggregate(dt1_preproc)

We used the Recursive Feature Elimination method to select the variables in order to decrease computational time going forward.

Recursive Feature Extraction

control <- rfeControl(functions=rfFuncs, method="cv", number=3)
trainctrl <- trainControl(classProbs= TRUE, summaryFunction = twoClassSummary)

results <- rfe(as.data.frame(dt1_preproc_sample)[,-c(153)],as.data.frame(dt1_preproc_sample)[,c(153)], sizes=c(1:100), rfeControl=control, method="rf",metric = "AUC", trControl = trainctrl)
print(results)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (3 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables RMSE Rsquared  MAE RMSESD RsquaredSD MAESD Selected
##          1 4064   0.1408 3357  35.36   0.015490 35.47         
##          2 3743   0.2572 3019  30.74   0.016769 45.54         
##          3 3479   0.3542 2830  46.82   0.025566 71.31         
##          4 3231   0.4542 2620  11.41   0.014291 48.68         
##          5 2898   0.5755 2340  72.13   0.015319 62.65         
##          6 2807   0.5807 2184  87.82   0.020661 99.44         
##          7 2818   0.5783 2212  75.45   0.017376 64.63         
##          8 2812   0.5804 2212  74.20   0.016274 65.11         
##          9 2735   0.6005 2095  71.01   0.016058 74.68         
##         10 2715   0.6067 2085  56.62   0.011686 56.60         
##         11 2687   0.6151 2067  73.00   0.017082 67.89         
##         12 2693   0.6124 2036  65.80   0.016328 73.92         
##         13 2696   0.6116 2045  60.57   0.014360 67.76         
##         14 2682   0.6159 2033  47.49   0.009743 48.93         
##         15 2690   0.6132 2024  46.25   0.011474 42.49         
##         16 2678   0.6167 2021  45.06   0.010037 40.99         
##         17 2671   0.6188 2026  53.48   0.012171 51.23         
##         18 2669   0.6196 2021  61.07   0.013225 54.66         
##         19 2661   0.6218 2020  62.67   0.014114 56.53         
##         20 2657   0.6235 2020  54.97   0.011844 55.71         
##         21 2650   0.6254 2013  61.02   0.013371 52.39         
##         22 2641   0.6282 2012  49.55   0.009886 43.01         
##         23 2634   0.6307 2011  50.53   0.009198 42.17         
##         24 2634   0.6304 2008  56.10   0.010650 50.61         
##         25 2631   0.6315 2010  54.94   0.010583 50.70         
##         26 2631   0.6316 2010  51.95   0.010157 47.85         
##         27 2633   0.6309 2009  55.38   0.011148 53.75         
##         28 2628   0.6326 2006  64.23   0.013452 59.92         
##         29 2630   0.6321 2011  57.55   0.011792 57.82         
##         30 2624   0.6336 2000  57.92   0.012394 58.29         
##         31 2616   0.6361 2000  61.99   0.012320 60.49         
##         32 2628   0.6327 2006  63.51   0.013601 63.74         
##         33 2628   0.6325 2009  68.97   0.015640 67.15         
##         34 2625   0.6338 2006  59.60   0.012973 62.75         
##         35 2621   0.6349 2008  69.60   0.014960 67.46         
##         36 2619   0.6352 2001  65.37   0.014276 61.51         
##         37 2620   0.6352 2003  69.09   0.015192 66.94         
##         38 2615   0.6367 1996  58.95   0.012172 63.43         
##         39 2617   0.6362 2000  60.28   0.012249 59.08         
##         40 2614   0.6370 1998  52.74   0.010478 52.14         
##         41 2611   0.6380 1998  61.35   0.012543 55.99         
##         42 2611   0.6379 1996  65.21   0.013467 56.19        *
##         43 2617   0.6362 2001  63.15   0.012902 56.78         
##         44 2613   0.6375 1998  63.06   0.012878 56.06         
##         45 2618   0.6358 2000  66.94   0.014301 58.91         
##         46 2611   0.6379 1997  61.21   0.012222 51.02         
##         47 2619   0.6356 2005  70.34   0.014979 60.60         
##         48 2621   0.6350 2003  71.22   0.014996 61.89         
##         49 2617   0.6362 2002  66.43   0.013624 55.39         
##         50 2620   0.6353 2005  68.16   0.014362 55.60         
##         51 2616   0.6363 1999  66.58   0.013774 57.71         
##         52 2618   0.6359 2000  71.13   0.015636 57.32         
##         53 2617   0.6363 2000  72.91   0.015630 61.88         
##         54 2618   0.6358 2000  71.87   0.015619 64.28         
##         55 2621   0.6352 2006  72.04   0.015820 63.18         
##         56 2616   0.6369 2003  65.12   0.013557 58.01         
##         57 2619   0.6357 2002  63.71   0.013607 55.07         
##         58 2623   0.6345 2009  63.88   0.013597 53.42         
##         59 2625   0.6338 2007  60.41   0.012600 48.83         
##         60 2626   0.6335 2008  65.62   0.013935 56.65         
##         61 2620   0.6354 2004  60.88   0.012429 49.38         
##         62 2621   0.6351 2007  67.15   0.014499 53.20         
##         63 2623   0.6345 2007  62.79   0.013840 50.32         
##         64 2622   0.6348 2007  66.28   0.014270 52.00         
##         65 2619   0.6356 2007  68.48   0.014806 56.23         
##         66 2618   0.6362 2005  69.88   0.015706 56.63         
##         67 2616   0.6368 2003  65.83   0.014198 55.90         
##         68 2618   0.6362 2005  62.17   0.013090 51.95         
##         69 2619   0.6357 2006  63.15   0.013844 56.82         
##         70 2619   0.6357 2007  65.46   0.014815 52.99         
##         71 2614   0.6372 2003  59.53   0.012807 52.19         
##         72 2612   0.6378 2000  58.60   0.013109 51.49         
##         73 2614   0.6373 2003  60.26   0.013081 56.32         
##         74 2620   0.6356 2006  66.28   0.014651 58.37         
##         75 2615   0.6369 2000  66.40   0.015123 58.54         
##         76 2614   0.6372 2002  76.75   0.017644 65.08         
##         77 2613   0.6378 2002  64.45   0.014467 57.25         
##         78 2621   0.6352 2005  73.00   0.016605 60.59         
##         79 2620   0.6356 2007  62.19   0.013770 54.50         
##         80 2623   0.6348 2007  62.16   0.013861 53.21         
##         81 2615   0.6372 2003  59.47   0.013216 49.43         
##         82 2618   0.6363 2008  60.58   0.013463 54.53         
##         83 2618   0.6363 2008  57.41   0.012670 52.62         
##         84 2621   0.6353 2007  64.15   0.014311 56.72         
##         85 2623   0.6347 2008  63.85   0.014651 56.11         
##         86 2616   0.6370 2003  60.25   0.012473 53.03         
##         87 2621   0.6352 2007  55.66   0.012055 49.01         
##         88 2617   0.6366 2007  59.98   0.012853 51.22         
##         89 2621   0.6357 2008  58.24   0.011931 51.65         
##         90 2622   0.6351 2010  58.98   0.012051 48.56         
##         91 2617   0.6368 2006  58.76   0.012735 51.74         
##         92 2618   0.6365 2007  58.49   0.012773 49.33         
##         93 2619   0.6360 2006  57.09   0.011955 47.58         
##         94 2616   0.6369 2006  56.37   0.012014 50.23         
##         95 2617   0.6367 2006  65.79   0.014677 55.18         
##         96 2617   0.6367 2008  61.93   0.013284 54.38         
##         97 2618   0.6364 2007  62.62   0.013128 52.62         
##         98 2614   0.6377 2006  57.79   0.011503 48.28         
##         99 2616   0.6372 2005  61.26   0.013096 48.19         
##        100 2616   0.6370 2005  63.96   0.013596 50.85         
##        184 2617   0.6369 2008  58.15   0.011469 48.10         
## 
## The top 5 variables (out of 42):
##    EXT_SOURCE_1, DAYS_REGISTRATION, DAYS_ID_PUBLISH, CNT_CHILDREN, ORGANIZATION_TYPEXNA
cols_to_keep <- c('FLAG_OWN_CARN','`ORGANIZATION_TYPEIndustry: type 1`','DAYS_ID_PUBLISH','SK_ID_CURR','REG_CITY_NOT_LIVE_CITY','YEARS_BEGINEXPLUATATION_MODE','COMMONAREA_MODE','FLOORSMAX_MODE','LIVINGAPARTMENTS_MODE','YEARS_BUILD_MEDI','CODE_GENDERM','OCCUPATION_TYPEWaiters/barmen staff','TARGET','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','CNT_CHILDREN')
dt1_preproc_sample <- as.data.frame(dt1_preproc_sample)[, (colnames(dt1_preproc_sample) %in% cols_to_keep)]

dt1_preproc <- as.data.frame(dt1_preproc)[, (colnames(dt1_preproc) %in% cols_to_keep)]

cols_to_keep
##  [1] "FLAG_OWN_CARN"                       "`ORGANIZATION_TYPEIndustry: type 1`"
##  [3] "DAYS_ID_PUBLISH"                     "SK_ID_CURR"                         
##  [5] "REG_CITY_NOT_LIVE_CITY"              "YEARS_BEGINEXPLUATATION_MODE"       
##  [7] "COMMONAREA_MODE"                     "FLOORSMAX_MODE"                     
##  [9] "LIVINGAPARTMENTS_MODE"               "YEARS_BUILD_MEDI"                   
## [11] "CODE_GENDERM"                        "OCCUPATION_TYPEWaiters/barmen staff"
## [13] "TARGET"                              "EXT_SOURCE_1"                       
## [15] "EXT_SOURCE_2"                        "EXT_SOURCE_3"                       
## [17] "CNT_CHILDREN"

These are the final features we selected for our model.

library(caret)
library(rsample)

unique(dt1_preproc$TARGET)
## [1] 1 0
dt1_preproc$TARGET <- ifelse(dt1_preproc$TARGET == 0, 'Yes', 'No')
dt1_preproc$TARGET <- as.factor(dt1_preproc$TARGET)

# Create training and testing data partitions
inTrain <- createDataPartition(dt1_preproc$TARGET, p = 0.8, list = FALSE)
dtTrain <- dt1_preproc[inTrain, ]
dtTest <- dt1_preproc[-inTrain, ]

3 Model Building

traincntrl <- trainControl(method = 'repeatedcv',
                                         number = 5,
                                         repeats = 2,
                                         classProbs = TRUE, 
                                         sampling = "down",
                                         summaryFunction = twoClassSummary)

Support Vector Machine(SVM Radial)

svmFit <- train(TARGET ~.,
                data = subset_train_data,
                method = 'svmRadial',
                preProc = c('center','scale'),
                tuneLength = 3,
                trControl = traincntrl)

svmFit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 24601 samples
##    15 predictor
##     2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 19680, 19681, 19681, 19681, 19681, 19680, ... 
## Addtional sampling using down-sampling prior to pre-processing
## 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.25  0.7300545  0.6795065  0.6716560
##   0.50  0.7269650  0.6706891  0.6687155
##   1.00  0.7216793  0.6588592  0.6741764
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06342473
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.06342473 and C = 0.25.
plot(svmFit, scales = list(x=list(log =2)))

confusionMatrix(predictClasses, dtTest$TARGET)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No   3212 18543
##        Yes  1753 37994
##                                           
##                Accuracy : 0.67            
##                  95% CI : (0.6663, 0.6737)
##     No Information Rate : 0.9193          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1255          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.64693         
##             Specificity : 0.67202         
##          Pos Pred Value : 0.14764         
##          Neg Pred Value : 0.95590         
##              Prevalence : 0.08073         
##          Detection Rate : 0.05223         
##    Detection Prevalence : 0.35373         
##       Balanced Accuracy : 0.65947         
##                                           
##        'Positive' Class : No              
## 

In employing the Support Vector Machine (SVM) with a Radial Basis Function Kernel for classification tasks, the model was trained on a dataset comprising 24,601 samples with 15 predictor variables. The SVM utilized preprocessing techniques involving centering and scaling. The tuning parameters, particularly the cost parameter (C) and the radial basis function parameter (sigma), were optimized through cross-validated resampling. The selected model, with C set at 0.25 and sigma at 0.0634, demonstrated a Receiver Operating Characteristic (ROC) of 0.73. Subsequent evaluation on a test set of data (notably imbalanced) yielded a confusion matrix indicating an accuracy of 67%. Sensitivity and specificity were 64.69% and 67.20%, respectively, suggesting a trade-off between correctly identifying positive cases and minimizing false positives. The model’s prevalence and detection rates were also assessed, contributing to a comprehensive evaluation of its classification performance.

KNN

The k-Nearest Neighbors (KNN) algorithm was employed for binary classification, assuming that similar instances exist in close proximity.

knnFit <- train(TARGET ~.,
                data = subset_train_data,
                method = "knn",
                preProc = c("center", "scale"),
                metric = "Accuracy",
                tuneGrid = data.frame(.k = 1:20),
                trControl = traincntrl)

knnFit$results
##     k       ROC      Sens      Spec       ROCSD     SensSD      SpecSD
## 1   1 0.5787709 0.5848468 0.5726951 0.018295566 0.03241870 0.007727375
## 2   2 0.6068607 0.5772831 0.5716339 0.009981657 0.02150961 0.005909824
## 3   3 0.6345142 0.6223631 0.5945169 0.013106437 0.02059142 0.010046121
## 4   4 0.6414925 0.6180924 0.5919965 0.014459362 0.02477989 0.007446866
## 5   5 0.6559403 0.6289021 0.6113862 0.010888943 0.02321467 0.014508864
## 6   6 0.6567551 0.6276395 0.6018351 0.012154432 0.01276619 0.014003844
## 7   7 0.6679057 0.6442629 0.6126907 0.012070177 0.01641820 0.011643629
## 8   8 0.6726300 0.6437553 0.6139288 0.011869174 0.02287752 0.007904402
## 9   9 0.6782973 0.6457736 0.6214681 0.013764469 0.02351186 0.011484245
## 10 10 0.6827816 0.6538220 0.6183728 0.012353530 0.01690464 0.010838505
## 11 11 0.6825743 0.6497956 0.6307097 0.013536951 0.02605246 0.010819747
## 12 12 0.6902388 0.6633849 0.6289410 0.017913308 0.02951769 0.015877317
## 13 13 0.6888980 0.6470299 0.6343577 0.015250052 0.02320823 0.014425426
## 14 14 0.6913659 0.6598572 0.6283440 0.015418786 0.03006231 0.010133325
## 15 15 0.6914607 0.6636324 0.6313730 0.017881663 0.02928959 0.016104205
## 16 16 0.6954432 0.6616299 0.6269069 0.016671694 0.02096529 0.015075127
## 17 17 0.6996474 0.6769768 0.6304002 0.015738800 0.02403170 0.011118322
## 18 18 0.6946415 0.6543258 0.6377183 0.012996049 0.02287255 0.020672140
## 19 19 0.6982778 0.6560814 0.6403051 0.013811924 0.02607474 0.013130324
## 20 20 0.6983789 0.6656646 0.6372098 0.011480325 0.02545053 0.013973171
conf_matrix <- confusionMatrix(predictions_knn, dtTest$TARGET)
conf_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No   3176 20842
##        Yes  1789 35695
##                                           
##                Accuracy : 0.632           
##                  95% CI : (0.6282, 0.6358)
##     No Information Rate : 0.9193          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0985          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.63968         
##             Specificity : 0.63136         
##          Pos Pred Value : 0.13223         
##          Neg Pred Value : 0.95227         
##              Prevalence : 0.08073         
##          Detection Rate : 0.05164         
##    Detection Prevalence : 0.39052         
##       Balanced Accuracy : 0.63552         
##                                           
##        'Positive' Class : No              
## 

The model was trained on a subset of the data with preprocessing steps involving centering and scaling. A range of values for k (number of neighbors) was explored, and the optimal k was determined based on accuracy. The KNN model with the best performance had k=17, achieving an accuracy of 63.2%. The confusion matrix revealed a balanced sensitivity and specificity, indicating a moderate ability to correctly identify both positive and negative cases. The KNN algorithm demonstrated a sensitivity of 63.97%, specificity of 63.14%, and an overall balanced accuracy of 63.55%.

Logistic Regression

logisticReg <- train(TARGET ~.,
                     data = dtTrain,
                     method = 'glm',
                     metric="Accuracy",
                     trControl = traincntrl)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
predictions <- predict(logisticReg, newdata = dtTest, type = "raw")
confusionMatrix(predictions, dtTest$TARGET)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No   3301 18750
##        Yes  1664 37787
##                                           
##                Accuracy : 0.6681          
##                  95% CI : (0.6643, 0.6718)
##     No Information Rate : 0.9193          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1297          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.66485         
##             Specificity : 0.66836         
##          Pos Pred Value : 0.14970         
##          Neg Pred Value : 0.95782         
##              Prevalence : 0.08073         
##          Detection Rate : 0.05367         
##    Detection Prevalence : 0.35854         
##       Balanced Accuracy : 0.66661         
##                                           
##        'Positive' Class : No              
## 
roc_curve <- roc(dtTest$TARGET, as.numeric(predictions == "Yes"))  # Assuming "Yes" is the positive class
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve for Binary Classification Model", col = "blue")

Logistic Regression was employed for binary classification, offering a balance between speed and accuracy compared to other techniques. The logistic regression model achieved an accuracy of 66.81% on the test dataset. The confusion matrix revealed balanced sensitivity and specificity, indicating a moderate ability to correctly classify both positive and negative instances. The sensitivity was 66.49%, specificity was 66.84%, and the balanced accuracy reached 66.66%. The model’s performance is characterized by a Kappa value of 0.1297, suggesting a fair agreement beyond chance. Additionally, the Receiver Operating Characteristic (ROC) curve analysis further assesses the model’s ability to discriminate between classes.

4 Model Evaluation and Comparison

  1. Support Vector Machine (SVM): SVM, leveraging a radial basis function kernel, achieved a commendable 67% accuracy on the test dataset. Demonstrated a balanced sensitivity (64.69%) and specificity (67.20%). Offers a strategic tool for anticipating and streamlining loan approval outcomes. The tuned model used sigma = 0.06342473 and C = 0.25.

  2. k-Nearest Neighbors (KNN): KNN algorithm, considering 20 different k values, achieved a 63.2% accuracy on the test dataset. Balanced sensitivity (63.97%) and specificity (63.14%). Presents a potential approach for discerning patterns in loan approval outcomes.

  3. Logistic Regression: Logistic Regression, known for its computational efficiency, achieved an accuracy of 66.81% on the test dataset. Balanced sensitivity (66.49%) and specificity (66.84%). Strikes a balance between speed and accuracy in comparison to more complex models.

5 Conclusion and Recommendations

Each model presents strengths and trade-offs in accuracy and computational efficiency.

SVM, with its 67% accuracy, stands out for robust predictive modeling, though computational intensity should be considered.

KNN, while less accurate, offers an alternative perspective, potentially uncovering unique patterns.

Logistic Regression, with its 66.81% accuracy, is an attractive option due to its balance between speed and accuracy.

Further refinement and exploration of ensemble approaches could enhance predictive power. We recommend exploration and refinement to achieve the desired paradigm shift in financial decision-making based on the strengths and trade-offs of each model in accuracy and computational efficiency to achieve an optimal and balanced predictive model that not only minimizes risks associated with loan approvals but also revolutionizes the decision-making landscape for lending institutions, aligning with our overarching goal of transforming financial practices and fostering a more secure and informed lending experience.

Our journey extends beyond data; it’s about transforming the landscape of financial decision-making, fostering an enriched lending experience for all stakeholders.